Until now, most generative music models have produced mono sound. This means MusicGen does not create a stereo mix with sounds or instruments on the left and right side, resulting in a less vibrant and thrilling mix. Stereo sound has been mostly overlooked because it is a challenging task to generate.
As musicians, we have the ability to place individual instrument tracks wherever we want in a stereo mix. However, MusicGen does not generate separate instruments but instead produces a combined audio signal. Creating stereo sound without access to these individual instrument sources is difficult. Unfortunately, splitting an audio signal into its individual sources is a complex problem that is not yet fully resolved.
To address this, Meta decided to incorporate stereo generation directly into the MusicGen model. They trained MusicGen using a new dataset of stereo music to produce stereo outputs. The researchers claim that generating stereo does not require additional computing costs compared to mono.
Although the paper does not provide a clear description of the stereo procedure, my understanding is that MusicGen generates two compressed audio signals (left and right channel) instead of one mono signal. These compressed signals are then decoded separately and combined to create the final stereo output. The reason this process does not take twice as long is that MusicGen can produce two compressed audio signals in approximately the same time it previously took for one signal.
The ability to produce convincing stereo sound sets MusicGen apart from other state-of-the-art models like MusicLM or Stable Audio. This addition significantly enhances the liveliness of the generated music. Listen for yourselves (might be hard to hear on smartphone speakers):
Mono
Stereo
MusicGen was impressive upon its release. However, Meta’s FAIR team has continuously improved the product, resulting in higher quality and more authentic results. In terms of text-to-music models that generate audio signals, MusicGen surpasses its competitors (as of November 2023).
Furthermore, since MusicGen and its related products (EnCodec, AudioGen) are open-source, they serve as an incredible source of inspiration and a go-to framework for aspiring AI audio engineers. Looking at the improvements made by MusicGen in just six months, it is exciting to imagine what 2024 will bring.
Another important aspect is that Meta’s transparent approach provides foundational work for developers looking to integrate this technology into software for musicians. Whether it’s generating samples, brainstorming musical ideas, or exploring different genres, the applications are already starting to show. With sufficient transparency, we can ensure that AI enhances the excitement of creating music instead of being perceived solely as a threat to human musicianship.